Tuna-2: Pixel Embeddings Beat Vision Encoders for Multimodal Understanding and Generation